library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(jmv)
library(leaflet)
library(plotrix)
library(httr)
library(rvest)
library(sqldf)Women’s Leadership and Carbon Disclosure by The Global Energy Companies
This document is a product of the final project for the STAT 570 lecture, focusing on data handling and visualization tools. It is essential to acknowledge that minor errors may be present, and the methods employed may not necessarily reflect the optimal approach related to the dataset.
Author:
Ekinsu Çiçek - ekinsu.cicek@metu.edu.tr
Kübra Nur Akdemir - kubra.akdemir@metu.edu.tr
Mehmet Ali Erkan - erkan.ali@metu.edu.tr
Oğuzhan Aydın - aydin.oguzhan_01@metu.edu.tr
The largest worldwide polluter and one of the main sources of carbon emissions is the energy sector. When the United Nations introduced the Sustainable Development Goals, they emphasized gender equality and the connection between women’s problems and the climate disaster. However, in the past 10 years, there has been relatively little focus on climate mitigation efforts. Consequently, the primary issues raised by this study are the energy industry, and women and climate change. The impact of women’s leadership on carbon disclosure among the top 100 global energy leaders between 2018 and 2020 was investigated in this study.
Data set in this article contains solid information on the exceptional women directors (EWDs) and carbon information disclosures (CID) of global energy leading companies. The information provided is relevant to the study “The Effect of Women’s Leadership on Carbon Disclosure by the Top 100 Global Energy Leaders” [1]. For the purpose of examining the extent of EWDs and the level of CID, data from 97 companies was obtained through the use of a content analysis technique based on enhanced scoring indicators. The companies’ websites or any related reports, such as sustainability, environmental, annual, or integrated reports, were consulted. The data was collected over a three-year period (2018–2020) for the global energy leading companies that were listed in the Thomson Reuters Database in 2017.
All details about the data such as data collection, compilation and entry into files can be accessed under the name of the data article which is called “Dataset of exceptional women directors and carbon information disclosures of global energy companies” [2]. According to the linked data article, the data includes 97 companies with 291 observations from Thomson Reuters listings. Purposive random sampling was used to choose the companies for the sample based on the data collected. R programming is used in all analysis parts of this study, and the document is created using Quarto[3].
1.Data Descriptions
| File Name | Name of the data in article | File description |
|---|---|---|
| CID Score (Table A).xlsx | Table A | Binary and total CID scores for each company |
| WDs Engagement (Table B).xlsx | Table B | Percentage of WDs’ engagement based on their classifications and percentage on the board |
| EWDs Aggregated Score (Table C).xlsx | Table C | WDs engagement scores marked by aggregated score among the four dimensions of EWDs. |
The data set is consist of three different files. The first one is CD Score which is called the Table A basically shows the CID scores, It includes ratings based on some criteria and gives a total CID score for each year. Table B shows the women directors engagement based on some classifications. All data in the Table B saved as percentages. Lastly, Table C show us the EWDs engagement scores and also gives the director names for each company. In the later steps, the new column will be added in the Table C. Additionally, each file includes three different sheets named 2018, 2019, and 2020, each containing recorded data for the corresponding year. See the data schema visual below:
Since the data was published in October 2023, no detailed study has been done on the data yet. According to the author, this data set can be used to create a more sustainable future, the data is helpful for researchers examining women’s participation in net-zero emissions, gender equality, climate resilience, renewable energy, and energy transition in corporate boardrooms (Majid et all., 2023). For this purpose, some analyzes were applied to better understand the data and draw meaningful conclusions from the data.
2.Reading and Cleaning the Data
The following libraries were used in this study to make the necessary arrangements.
2.1 Necessary Libraries
readxl() is used for reading Excel files into R [4]. Part of the tidyverse, dplyr() is a powerful package for data manipulation [5]. ggplot2 is a popular package for creating static, interactive, and dynamic visualizations in R [6]. plotly() enables the creation of interactive plots and dashboards [7]. jmv() is used for statistical analysis and modelling [8]. leaflet() is a library that creating a interactive maps [9]. plotrix() is used for creating specialized plots and adding features to existing plots [10]. httr() is a package for working http requests, and also making it easier to interact with APIs [11]. rvest() is a web scrapping package in R, used for extracting data from HTML web pages [12]. Lastly, sqldf() allows to perform SQL - like operations on R data frames [13].
2.2 Downloading and Reading the Data
Firstly, local file name is set for the zip file, and download it. As you can see from the link, the data stored in Amazon Web Services. Here you can find the data [14]. Contents are extracted, and defined the names of specific excel files. These files likely contain different data sets related to the project. File names with excel extensions were created for these three different files as CID Scores, WDs Engagement and EWDs Aggregated Score.
#defining the link for the data
zipF <- "https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/d2s9yz65mm-4.zip"
#set out the local file name for the zip file
local_zip_filename <- "Dataset of Women Directors Engagement.zip"
#downloading the data
download.file(zipF, local_zip_filename, mode = "wb", method = "auto")
#unzip the file
unzip(local_zip_filename)
#defining the name of the excel files
excel_files <- c("CID Scores (Table A).xlsx", "WDs Engagement (Table B).xlsx", "EWDs Aggregated Score (Table C).xlsx")CID Scores (Table A)
This excel file includes information about binary and total CID scores of the companies between the years of 2018 and 2020. The nine categories and 90 indicators of carbon disclosure procedures are included in Table 1 as a stand-in for CD practices for the sample firms. By assigning a score of 1 for “disclose” and a score of 0 for “not disclose” to each indication, the scoring approach evaluated the degree of CD among the energy leaders. This resulted in scores for each firm as well as an assessment using the content analysis technique. The highest score that may be obtained is 90. At the conclusion of the scoring methodology procedure, the amount of points for exposing any indications in the CD index will be divided.
#defining working directory as the file name
setwd("Dataset of Women Directors’ Engagement and Carbon Information Disclosures of Global Energy Companies")
#reading each sheet in the excel file
CID_2018 <- read_excel("CID Scores (Table A).xlsx", sheet = "FYE 2018")
CID_2019 <- read_excel("CID Scores (Table A).xlsx", sheet = "FYE 2019")
CID_2020 <- read_excel("CID Scores (Table A).xlsx", sheet = "FYE 2020")After the reading sheets, R reads titles as a rows and these column names were deleted. Moreover, there are no column names in the file so that we assigned column names for the tables. For that, instead of the giving column names for each sheets separately, a function that assign all common columns for each sheets was created. So, about the merging phase of these three different sheets, we try to first merge them as a column, but the number of columns has increased a lot, so that new columns was created for each sheets -year column- and combined data(all sheets) by rows with the rbind() in a file, instead of combining columns based. Lastly, the total CID scores of the companies are out of 90 and a different results were obtained for each year so that we combined the newly created year variable with existing total CID scores. In the end, except the company names and year variables, we changed all variables into numeric.
#deleting unnecessary rows
CID_2018 <- CID_2018[-c(1,2),]
CID_2019 <- CID_2019[-c(1,2),]
CID_2020 <- CID_2020[-c(1,2),]
#setting common column names
set_column_names <- function(data, year) {
colnames(data) <- c("company_name", "strategy_policy", "climate_change_opportunities",
"corporate_ghg_emissions_targets", "company_wide_carbon_footprint",
"ghg_emissions_change_over_time", "energy_related_reporting",
"emission_reduction_initiatives_implementation",
"carbon_emission_accountability", "quality_of_disclosure",
paste0("quality_of_DisclosureTotal_cid_scores"),"year")
data$year <- year
return(data)
}
CID_2018 <- set_column_names(CID_2018,"2018")
CID_2019 <- set_column_names(CID_2019,"2019")
CID_2020 <- set_column_names(CID_2020,"2020")
#combining all the tables into one table
CID_scores <- rbind(CID_2018, CID_2019,CID_2020)
#converting variables to numeric
CID_scores <- CID_scores %>%
mutate_at(vars(-company_name, -year), as.numeric)WDs Engagement (Table B)
This file contains information about women board members classification between 2018 and 2020. There are four different predictors in the data set.
(1) The percentage of women board members
(2) The percentage of women board members who are industry experts
(3) The percentage of women board members who act as advisors (ADV
(4) The percentage of women board member who are community leaders(CL) among the board members for each company.
All sheets are read. Like a first data set, titles seems as rows and column names does not seems well.
#defining working directory as the file name
setwd("Dataset of Women Directors’ Engagement and Carbon Information Disclosures of Global Energy Companies")
#reading each sheet one by one
WDs_2018 <- read_xlsx("WDs Engagement (Table B).xlsx", sheet = "FYE 2018")
WDs_2019 <- read_xlsx("WDs Engagement (Table B).xlsx", sheet = "FYE 2019")
WDs_2020 <- read_xlsx("WDs Engagement (Table B).xlsx", sheet = "FYE 2020")First rows of the data set was deleted, and applied a function to change all column names. Moreover, for combining all sheets, new year variables were created and added to data set. All sheets were combined by rows. As you can seen from the R view, as a mention before the data in this file consists of a percentage, when read them, the decimal numbers were too much. To change this, a function was applied. Firstly, with the mutate_at() functions, company_name and year variables were removed since these variables are categorical variables in a way. Thus, percentage sign was removed with sub() function, and all these values convert into numeric with the as.numeric functions and then round function applied to get two decimals.
#deleting uncesessary rows
WDs_2018 <- WDs_2018[-1,]
WDs_2019 <- WDs_2019[-1,]
WDs_2020 <- WDs_2020[-1,]
#setting common column names
set_column_names_2 <- function(data, year) {
colnames(data) <- c("company_name", "number_of_wd", "per_of_wd_on_board",
"per_of_wd_industry_expert",
"per_of_wd_advisors","per_of_wd_community_leader")
data$year <- year
return(data)
}
WDs_2018 <- set_column_names_2(WDs_2018,"2018")
WDs_2019 <- set_column_names_2(WDs_2019,"2019")
WDs_2020 <- set_column_names_2(WDs_2020,"2020")
#combining all the tables in one table
WDs_engagement <- rbind(WDs_2018, WDs_2019, WDs_2020)
#converting columns to numeric (excluding company_name,year)
WDs_engagement <- WDs_engagement |>
mutate_at(vars(-company_name,-year), function(x) round(as.numeric(sub("%", "", x)), 2))EWDs Aggregated Scores (Table C)
The file has the information about EWD’s engagement scores are indicated by the total score across the four different EWD parameters. Compare to the other files, the file also have women director names. There are different binary variables about the director positions that whether the director is expert, adviser or leaders. Moreover, there is a variable about the director years experience as well.
All sheets are read one by one.
#defining working directory as the file name
setwd("Dataset of Women Directors’ Engagement and Carbon Information Disclosures of Global Energy Companies")
#reading each sheet in the excel file
EWDs_2018 <- read_xlsx("EWDs Aggregated Score (Table C).xlsx", sheet = "FYE 2018")
EWDs_2019 <- read_xlsx("EWDs Aggregated Score (Table C).xlsx", sheet = "FYE 2019")
EWDs_2020 <- read_xlsx("EWDs Aggregated Score (Table C).xlsx", sheet = "FYE 2020")
# You can also read these files in a more efficient way:
# years <- 2018:2020
#
# read_ewds_list <- lapply(years, function(year) {
# read_xlsx("EWDs Aggregated Score (Table C).xlsx", sheet = paste("FYE", year))
# })
#
# ewds_all <- do.call(rbind, read_ewds_list)In this data set, name of the columns are similar to name that we assigned. For the easy interpretation, different column names are set and year column are added to the file.
As you can see there is a different issue in the data set compare to previous data. Some company names are empty so we assume the empty places are belong to the company above. When we read the file in R, it’s obvious that R seems these value as NA. Thus, we should replace these empty places with the company names. Initially, we specified how many maximum number of space in a sheet and then it’s known that sheet 2018 includes maximum four empty for a same company name, sheets 2019 includes 5, and sheets 2020 includes 6 maximum space. To fill and the organize these columns, for loops were applied to each different sheets. If a value in the company name column is NA, it replaces the NA value with the previous value in the same column by using lag() function. If value is not NA, it keeps the original value, otherwise, company name. After these procedure, our list file is also ready for the analysis.
#setting common column names
set.column.names <- function(data,year) {
colnames(data) <- c("company_name", "code", "WDsName", "industry_expert",
"advisor", "community_leaders", "energy_experiments", "log_of_energy_experiment")
data$year <- year
return(data)
}
EWDs_2018 <- set.column.names(EWDs_2018,"2018")
EWDs_2019 <- set.column.names(EWDs_2019,"2019")
EWDs_2020 <- set.column.names(EWDs_2020,"2020")
#replacing name in empty space
for (i in 1:4) {EWDs_2018 <- EWDs_2018 |>
mutate('company_name' = ifelse(is.na(company_name), lag(company_name), company_name))}
for (i in 1:5) {EWDs_2019 <- EWDs_2019 |>
mutate('company_name' = ifelse(is.na(company_name), lag(company_name), company_name))}
for (i in 1:6) {EWDs_2020 <- EWDs_2020 |>
mutate('company_name' = ifelse(is.na(company_name), lag(company_name), company_name))}
#combining all the tables into one table
EWDs_scores <- rbind(EWDs_2018,EWDs_2019,EWDs_2020)After that, a categorical variable was added to the data set by generating new variables from data manipulation techniques. It would be a nice way to add titles to managers’ years of energy experience. As a result, we divided the years that managers worked into some classes. We classified managers who had no experience or did not work in the energy sector as “no experience” because they did not have any experience in this field. We evaluated directors with working years up to 15 years as assistant director, those with 15 to 30 years of experience as director, and those with more than 30 years as senior director. We combined this classification under the wd_director column.
#converting energy_experiments variable into numeric
EWDs_scores$energy_experiments <- as.numeric(EWDs_scores$energy_experiments)
#adding new column as wd_title accoring to the energy_experiments
EWDs_scores <- EWDs_scores |>
mutate(wd_title = cut(energy_experiments, breaks = c(-Inf, 0 ,15, 30, 49), labels = c("No Experience","Assistant Director", "Director","Senior Director"), include.lowest = TRUE))3. Descriptive Statistics
CID Scores Table
sm_CID_scores <- CID_scores |> select(-company_name,-year)
summary(sm_CID_scores) strategy_policy climate_change_opportunities corporate_ghg_emissions_targets
Min. :0.000 Min. :0.000 Min. :0.00
1st Qu.:6.000 1st Qu.:5.000 1st Qu.:2.00
Median :7.000 Median :5.000 Median :3.00
Mean :6.485 Mean :4.845 Mean :2.77
3rd Qu.:7.000 3rd Qu.:5.000 3rd Qu.:4.00
Max. :7.000 Max. :5.000 Max. :5.00
company_wide_carbon_footprint ghg_emissions_change_over_time
Min. : 0.000 Min. :0.000
1st Qu.: 8.000 1st Qu.:2.000
Median :10.000 Median :3.000
Mean : 9.364 Mean :2.718
3rd Qu.:12.000 3rd Qu.:3.000
Max. :13.000 Max. :6.000
energy_related_reporting emission_reduction_initiatives_implementation
Min. : 0.000 Min. : 0.00
1st Qu.: 7.000 1st Qu.:17.00
Median : 9.000 Median :18.00
Mean : 8.351 Mean :17.32
3rd Qu.:11.000 3rd Qu.:19.00
Max. :11.000 Max. :20.00
carbon_emission_accountability quality_of_disclosure
Min. :0.000 Min. : 0.00
1st Qu.:6.000 1st Qu.: 8.00
Median :6.000 Median :11.00
Mean :5.763 Mean :10.48
3rd Qu.:6.000 3rd Qu.:14.00
Max. :6.000 Max. :17.00
quality_of_DisclosureTotal_cid_scores
Min. : 0.00
1st Qu.:64.00
Median :71.00
Mean :68.19
3rd Qu.:77.00
Max. :87.00
The collection offers insightful information about how businesses are adapting to climate change, especially when it comes to sustainability efforts and greenhouse gas (GHG) emissions. Interestingly, the average carbon footprint of corporations is 6.485, which highlights the necessity for coordinated efforts in emission mitigation. These footprints vary but are often large. The various GHG emissions objectives, which average 4.845 and show a good trend toward accepting and resolving environmental obligations, demonstrate a commitment to reduction. Moreover, positive trends in the execution of emission reduction programs are evident, as businesses regularly disclose their energy-related operations (average score of 8.351). Strong disclosure quality (average of 10.48) and high accountability scores (average of 5.763) highlight businesses’ accountability for monitoring and openly disclosing carbon emissions. With varied but typically excellent achievements in lowering carbon footprints and adopting transparent reporting procedures, the data set indicates an overall positive trajectory, demonstrating business commitment to sustainability and climate action.
WDs Engagement Table
sm_WDs_engagement <- WDs_engagement |> select(-company_name,-year)
summary(sm_WDs_engagement) number_of_wd per_of_wd_on_board per_of_wd_industry_expert
Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.:1.000 1st Qu.:0.1200 1st Qu.:0.0800
Median :3.000 Median :0.2500 Median :0.2000
Mean :2.763 Mean :0.2377 Mean :0.2054
3rd Qu.:4.000 3rd Qu.:0.3300 3rd Qu.:0.3300
Max. :8.000 Max. :0.6000 Max. :0.5600
per_of_wd_advisors per_of_wd_community_leader
Min. :0.000 Min. :0.0000
1st Qu.:0.080 1st Qu.:0.0000
Median :0.200 Median :0.1000
Mean :0.203 Mean :0.1319
3rd Qu.:0.330 3rd Qu.:0.2150
Max. :0.600 Max. :0.5000
The data set provides information on the distribution of women directors (WD) on corporate boards. It shows that there are, on average, 2.76 WD per board, ranging from 0 to 8. The average proportion of female directors is 23.77%, indicating a range in gender representations across quartiles. The range of female directors is 0% to 60%. Furthermore, around 20.54% of female directors contribute their professional experience to the boards, highlighting the significance of fusing specialized knowledge with gender diversity. In addition, data shows that roughly 20.3% of female directors work as advisers, demonstrating their ability to impact strategic choices. With an average percentage of 13.19%, the data set further emphasizes the importance of female directors as community leaders. All things considered, these results paint a complex picture of gender diversity on boards, including differences in the quantity, level of experience, advising capacities, and leadership positions in the community held by women directors.
EWDs Scores Table
sm_EWDs_scores <-EWDs_scores |> select(energy_experiments,log_of_energy_experiment,wd_title)
summary(sm_EWDs_scores) energy_experiments log_of_energy_experiment wd_title
Min. : 0.000 Min. :0.0000 No Experience : 73
1st Qu.: 2.000 1st Qu.:0.7945 Assistant Director:560
Median : 5.000 Median :1.7918 Director :112
Mean : 9.435 Mean :1.7824 Senior Director : 62
3rd Qu.:13.000 3rd Qu.:2.7081 NA's : 25
Max. :49.000 Max. :3.8918
NA's :25 NA's :98
The data set, which has a mean of 9.435 and ranges from 0 to 49, shows a variety of energy experimentation. In order to provide a compact representation, log-transformed values offer a normalized view (0.0000 to 3.8918, mean log of 1.7824). The most common titles are “Assistant Director,” “Director,” and “Senior Director,” with 73 of them having the designation “No Experience.” Nonetheless, the data set has 25 missing values in the energy experiments and 98 missing log-transformed values. The depth of the information resides in its ability to capture a range of responsibilities and expertise levels; hence, more analysis is necessary to identify any potential relationships between professional titles and energy experiments.
4. Analysis and Research Questions
4.1 Research Questions:
- What are the companies with the highest average quality of disclosure total of CID scores of three years 2018, 2019, 2020?
#finding the mean cid_scores of three years for each company and slicing the highest ten
mean_cid <- CID_scores %>%
filter(year %in% c("2018", "2019", "2020")) %>%
group_by(company_name) %>%
summarize(avg_cid_score = mean(quality_of_DisclosureTotal_cid_scores, na.rm = TRUE)) |>
slice_max(order_by=avg_cid_score,n=10)
mean_cid# A tibble: 10 × 2
company_name avg_cid_score
<chr> <dbl>
1 Hera 83.3
2 Fairmount Santrol 83
3 Encana 80.7
4 Formosa Petrochemical Corporation 80
5 Hellenic Petroleum 79.7
6 Acea SpA 79.3
7 DCC 78.7
8 Hess Corporation 78.7
9 PKN ORLEN 78.7
10 Suncor Energy 78.7
plot_ly(data = mean_cid, x = ~company_name, y = ~avg_cid_score, type = 'bar', color = ~avg_cid_score) %>%
layout(title = "Top 10 Companies by Average Score",
xaxis = list(title = "Company Name", categoryorder = 'total descending'),
yaxis = list(title = "Average Score", range = c(75, 85)))Top 10 companies by average CID scores for three years are Suncor Energy, Acea SpA, Hera, Fairmount Santrol, Encana, Formosa Petrochemical Corporation, Hellenic Petroleum, PKN ORLEN, Hess Corporation, and DCC. Compared to the other companies, company of Hera has the highest average CID score with 83.33. Then, it’s followed by the company of Fairmount Santrol with 83 average CID score. In the top 10 company, the lowest average CID score is 78.67, and there are four company that has the same score. According to the article of Dataset of exceptional women directors and carbon information disclosures of global energy companies (Majid et al., 2023), all of ten companies are at the high CID ranking
- How does the percentage of experience levels of women directors change between the years 2018 and 2020?
library(ggplot2)
ggplot(EWDs_scores, aes(x = wd_title, fill = factor(year))) +
geom_bar(position = position_dodge(width = 0.8), stat = "count", show.legend = TRUE, width = 0.7) +
labs(title = "Change of WD Titles Over Three Years",
x = "WD Title",
y = "Count") +
scale_fill_brewer(palette = "Set3") + # Change the color palette
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels
geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.8), vjust = -0.5)Directors of companies have four titles according to their experienced year. These are director with no experience, assistant director, director, and senior director. According to the plot, companies have assistant director at most in three years by a wide margin. Percentage of these directors, increased in 2020 from 66% to 74% compared to 2019. As percentage of assistant director increased, percentage of women directors with no experience decreased over three years. This is the biggest decrease, from 12% to 3% compared to change of other titles
- Does the CID score decrease on the average as the number of female directors increases?
CID_by_year <- CID_scores |>
group_by(year) |>
summarize(
avg_company_wide_carbon_footprint = mean(company_wide_carbon_footprint),
avg_ghg_emissions_change_over_time = mean(ghg_emissions_change_over_time),
avg_emission_reduction_initiatives_implementation = mean(emission_reduction_initiatives_implementation),
avg_carbon_emission_accountability = mean(carbon_emission_accountability),
avg_quality_of_disclosure = mean(quality_of_disclosure),
avg_quality_of_DisclosureTotal_cid_scores = mean(quality_of_DisclosureTotal_cid_scores))This code group the CID_scores data frame by year and calculate the average of each column for each year. The resulting data frame will have one row for each year and columns for each of the calculated averages.
avg_numberwd <- WDs_engagement |>
group_by(year) |>
summarize(
avg_number_of_wd = mean(number_of_wd))
fig <- plot_ly(data = CID_by_year,x = ~year,
y = ~log(avg_quality_of_DisclosureTotal_cid_scores), name = "Quality of Disclosure Total CID Scores",
type = "scatter",mode = "lines") |>
add_trace(y = ~avg_numberwd$avg_number_of_wd, name = "Average number of Women Directors")
figGrouping the WDs_engagement data frame by year and calculate the average number of women directors for each year. The resulting data frame will have one row for each year and a column for the calculated average. In order to have a better visualization, I take the log of CID scores. Since the range between these two columns are high, we could not clearly see the increases or decreases in the graph. In the plot, it looks like as the number of WD increases, quality of disclosure Total CID scores tend to decrease. However, since we do not have a enough observations in year basis, and we did not apply a statistical tests, this result is not trustable.
- In the article, authors created to some hypothesis about the data set. Here is the some hypothesis:
1-Women board members who are industry experts will positively enhance carbon disclosure.
2-Women board members who act as advisers will positively enhance carbon disclosure.
3-Women board members who serve as community leaders will positively enhance carbon disclosure.
Authors specified that some assumptions may not be logical in sense, but we try to find whether these assumptions are true or not. For the test for all assumptions, the code first calculates the average number of women directors for each year using the WDs_engagement data frame. Then, it creates a plot using plot_ly function from the plotly package. The plot has the year on the x-axis and the logarithm of the average quality of disclosure total CID scores on the y-axis. The plot also has three lines representing the average number of women industry experts, community leaders, and advisers. In short, the plot shows the relationship between the Quality of Disclosure Total CID Scores and the Average number of Women Industry Experts, Average number of Women Community Leader, and Average number of Women Advisers.
avg_numberwd <- WDs_engagement |>
group_by(year) |>
summarize(
avg_number_of_wd = mean(number_of_wd),
avg_industry_expert = mean(per_of_wd_industry_expert),
avg_com_leader = mean(per_of_wd_community_leader),
avg_wd_advisors = mean(per_of_wd_advisors))
fig2 <- plotly::plot_ly(data = CID_by_year,x = ~year,
y = ~log(avg_quality_of_DisclosureTotal_cid_scores), name = "Quality of Disclosure Total CID Scores",
type = "scatter",mode = "lines") |>
add_trace(y = ~avg_numberwd$avg_industry_expert*10, name = "Average number of Women Industry Experts") |>
add_trace(y = ~avg_numberwd$avg_com_leader*10, name = "Average number of Women Community Leader") |>
add_trace(y = ~avg_numberwd$avg_wd_advisors*10, name = "Average number of Women Advisors")
fig2Since the range of the values we are interested in in the WD data set is between 0 and 1 and the range of the CID score is between 0 and 80, it would be difficult to see the breaks in the plot when we print the plot. For this reason, we took the log value of the CID score and multiplied the odds in the WD data set by 10 for better plot awareness. It does not look like there exists a meaningful relationship between the total CID scores and the other variables.
- How are energy companies distributed on the map by country and continent?
It is investigated the distribution of companies around the world both in continental basis and country basis. To do so, firstly, data should be pulled by applying web scraping. Previously, we had company data with longitude and latitude information for the continents.
url_ulke <- "https://developers.google.com/public-data/docs/canonical/countries_csv?hl=en"In the website of google designed for developers, there are available data sets. In one of them, we pulled a data set which includes the country name, longitude and latitude of the countries.
res <- GET(url_ulke)
html_con <- content(res, "text")
html_ulke <- read_html(html_con)First, we define the url of the data set. Then, by read_html function, we read the url in R. However, data is still not readable in r. It is in xml_document xml_node class.
tables <- html_ulke |>
html_nodes("table") |>
html_table() # Extract all tables from the webpageBy html_nodes() function, it is selected all the HTML table elements in the document. Then, html_table() converts the selected HTML tables into data frames.
long_lat_country <- tables[[1]]
long_lat_country <- long_lat_country[,-1]
colnames(long_lat_country)[3] <- "Country"Then, we assigned the first table extracted from the HTML document to the variable long_lat_country. The table includes information about the latitude and longitude of different countries, code and name of the countries. We dropped the country code variable, and edit the column names of the country name.
Moreover, we take the company names, continents and country where the company is located from appendix part in the article. [15]
company_2 <- read.csv('company_2.csv', sep = ";",header = T)Lastly, we combined two data by same column called “Country”. One of the them is web scrapping data, another is the data was taken in the article (company). Now, in addition to its company names, it also includes country, continent and geographical location information. In order to have all the information in company data set, we used all.x=TRUE option in merge function.
combined_data <- merge(x = company_2, y = long_lat_country, by = "Country", all.x = TRUE) df <- sqldf::sqldf("SELECT Country, continent, latitude, longitude, COUNT(*) AS Freq
FROM combined_data
GROUP BY Country
ORDER BY Freq DESC")
df_last <- (sqldf::sqldf("SELECT Freq,A.* FROM df B LEFT JOIN combined_data A ON A.latitude=B.latitude"))To investigate the distribution of companies all around the world, we group them by country. By doing so, we can easily see how many companies there are for each continent. For the better understanding of the data, we order the data by its frequency.
center_lon <- median(df$longitude, na.rm = TRUE)
center_lat <- median(df$lattitude, na.rm = TRUE)
library(leaflet)
getColor <- function(df) {
sapply(df$Freq, function(mag) {
if(mag <= 2) {
"green"
} else if(mag <= 4) {
"orange"
} else {
"red"
} })
}
icons <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = getColor(df)
)The median values of longitude and latitude are calculated in order to arrange the zoom of the map. However, we did not use this because the distances between the continents were too far. In the map, we defined this function to visually see how the number of companies varies from country to country. The cut off values are determined by its quarters.
str(df)'data.frame': 31 obs. of 5 variables:
$ Country : chr "United States" "United Kingdom" "Italy" "India" ...
$ Continent: chr "North America" "Europe" "Europe" "Asia" ...
$ latitude : num 37.1 55.4 41.9 20.6 46.2 ...
$ longitude: num -95.71 -3.44 12.57 78.96 2.21 ...
$ Freq : int 19 8 7 7 5 5 4 4 3 3 ...
leaflet() |>
addProviderTiles(providers$Esri,
options = providerTileOptions(noWrap = TRUE)) |>
addAwesomeMarkers(
data = df_last,
lng = ~longitude,
lat = ~latitude,
label = ~Country,
icon = icons,
popup = ~paste("<br>Number of Company:", Freq,
"<br>Company Names:", company_name),
clusterOptions = markerClusterOptions()
) leaflet() functions initializes a new Leaflet map. addProviderTiles() adds a tile layer from a known map provider. addAwesomeMarkers() adds markers to the map. The markers are created using data from the df_son. Lng and lat specify the longitude and latitude columns in the data frame. Label specifies the label column in the data frame. Icon specifies the icon for the markers. Popup specifies the HTML content for the popups that appear when a marker is clicked. clusterOptions specifies the options for clustering the markers.
5.Conclusion:
This project examines the relationship between women’s leadership and carbon disclosure in the top 100 global energy companies from 2018 to 2020. The study focuses on the energy sector’s role as a major polluter and explores the intersection of gender equality, women’s issues, and climate change, aligned with Sustainable Development Goals. The data set, collected over three years from content analysis of reports from 97 companies, aims to evaluate the influence of women’s leadership on carbon disclosure in these energy companies. R programming is used for analysis, and the results are presented using Quarto. The data set comprises three files: CD Scores (carbon disclosure), WDs Engagement (women directors’ classification), and EWDs Aggregated Scores. The article outlines the process of reading and cleaning each file, addressing challenges like empty company names. It highlights the significance of the data for potential research on gender equality, climate mitigation, and the role of women in the energy sector, emphasizing the use of various R packages for data manipulation and analysis. Overall, the project provides a comprehensive overview of the research’s methodology and the importance of the data set in addressing crucial issues in the energy industry. We generate some research questions. In order to answer these questions, we pull data set from the website. Also, we made some operations within the data and create new variables and sub-data sets. To answer these questions visually, we used plotly, leaflet and ggplot. Moreover, we try to give answer some hypothesis by author and can’t find any meaningful results for the hypothesis. The projects is one of the leading project for this data set and it is open to any feedback, comments and developments.
6. References:
[1] Abd Majid, Nurshahirah, and Amar Hisham Jaaffar. “The Effect of Women’s Leadership on Carbon Disclosure by the Top 100 Global Energy Leaders.” MDPI, Multidisciplinary Digital Publishing Institute, 23 May 2023, www.mdpi.com/2071-1050/15/11/8491.
[2] Nurshahirah Abd Majid a, et al. “Dataset of Exceptional Women Directors and Carbon Information Disclosures of Global Energy Companies.” Data in Brief, Elsevier, 5 Oct. 2023, www.sciencedirect.com/science/article/pii/S2352340923007357.
[3] Quarto, quarto.org/.
[4] Wickham H, Bryan J (2023). readxl: Read Excel Files. https://readxl.tidyverse.org, https://github.com/tidyverse/readxl.
[5] Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org.
[6] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
[7] “Plotly.” Plotly r Graphing Library in R, plotly.com/r/.
[8] “The ‘jamovi’ Analyses [R Package Jmv Version 2.4.11].” The Comprehensive R Archive Network, Comprehensive R Archive Network (CRAN), 12 Oct. 2023, cran.r-project.org/web/packages/jmv/index.html.
[9] “Create Interactive Web Maps with the JavaScript ‘leaflet’ Library [R Package Leaflet Version 2.2.1].” The Comprehensive R Archive Network, Comprehensive R Archive Network (CRAN), 13 Nov. 2023, cran.r-project.org/web/packages/leaflet/index.html.
[10] “Package Plotrix.” CRAN, Comprehensive R Archive Network (CRAN), cran.r-project.org/web/packages/plotrix/index.html.
[11] Wickham H (2023). httr: Tools for Working with URLs and HTTP. https://httr.r-lib.org/, https://github.com/r-lib/httr.
[12] Wickham H (2023). rvest: Easily Harvest (Scrape) Web Pages. https://rvest.tidyverse.org/, https://github.com/tidyverse/rvest.
[13] Grothendieck G (2017). sqldf: Manipulate R Data Frames Using SQL. R package version 0.4-11, https://CRAN.R-project.org/package=sqldf.
[14] abd majid, nurshahirah (2023), “Dataset of Women Directors’ Engagement and Carbon Information Disclosures of Global Energy Companies”, Mendeley Data, V4, doi: 10.17632/d2s9yz65mm.4
[15] TY - JOUR AU - Abd Majid, Nurshahirah AU - Jaafar, Amar PY - 2023/05/23 SP - 8491 T1 - The Effect of Women’s Leadership on Carbon Disclosure by the Top 100 Global Energy Leaders VL - 15 DO - 10.3390/su15118491 JO - Sustainability ER -